A Portable Algorithm for Mapping Bitext Correspondence
ثبت نشده
چکیده
The first step in most empirical work in multilingual NLP is to construct maps of the correspondence between texts and their translations (b i t ex t maps) . The Smooth Injective Map Recognizer (SIMR) algorithm presented here is a generic pattern recognition algorithm that is particularly well-suited to mapping bitext correspondence. SIMR is faster and significantly more accurate than other algorithms in the literature. The algorithm is robust enough to use on noisy texts, such as those resulting from OCR input, and on translations that are not very literal. SIMR encapsulates its language-specific heuristics, so that it can be ported to any language pair with a minimal effort. 1 I n t r o d u c t i o n Texts that are available in two languages (bitexts) are immensely valuable for many natural language processing applications z. Bitexts are the raw material from which translation models are built. In addition to their use in machine translation (Sato & Nagao, 1990; Brown et al., 1993; Melamed, 1997), translation models can be applied to machineassisted translation (Sato, 1992; Foster et al., 1996), cross-lingual information retrieval (SIGIR, 1996), and gisting of World Wide Web pages (Resnik, 1997). Bitexts also play a role in less automated applications such as concordancing for bilingual lexicography (Catizone et al., 1993; Gale & Church, 1991b), computer-assisted language learning, and tools for translators (e.g. (Macklovitch, 1 "Multitexts" in more than two languages are even more valuable, but they are much more rare. 1995; Melamed, 1996b). However, bitexts are of little use without an automatic method for constructing bitext maps. Bitext maps identify corresponding text units between the two halves of a bitext. The ideal bitext mapping algorithm should be fast and accurate, use little memory and degrade gracefully when faced with translation irregularities like omissions and in. versions. It should be applicable to any text genre in any pair of languages. The Smooth Injective Map Recognizer (SIMR) algorithm presented in this paper is a bitext mapping algorithm that advances the state of the art on these criteria. The evaluation in Section 5 shows that SIMR's error rates are lower than those of other bitext mapping algorithms by an order of magnitude. At the same time, its expected running time and memory requirements are linear in the size of the input, better than any other published algorithm. The paper begins by laying down SIMR's geometric foundations and describing the algorithm. Then, Section 4 explains how to port SIMR to arbitrary language pairs with minimal effort, without relying on genre-specific information such as sentence boundaries. The last section offers some insights about the optimal level of text analysis for mapping bitext correspondence. 2 B i t e x t G e o m e t r y A b i t e x t (Harris, 1988) comprises two versions of a text, such as a text in two different languages. Translators create a bitext each time they translate a text. Each bitext defines a rectangular b i t e x t space, as illustrated in Figure 1. The width and height of the rectangle are the lengths of the two component texts, in characters. The lower left corner of the rectangle is the or ig in of the bitext space and represents the two texts' beginnings. The upper right corner is the t e r m i n u s and represents the texts' ends. The line between the origin and the
منابع مشابه
A Portable Algorithm for Mapping Bitext Correspondence
The first step in most empirical work in multilingual NLP is to construct maps of the correspondence between texts and their translations (bitext maps). The Smooth Injective Map Recognizer (SIMR) algorithm presented here is a generic pattern recognition algorithm that is particularly well-suited to mapping bitext correspondence. SIMR is faster and significantly more accurate than other algorith...
متن کاملModels of Co-occurrence
A model of co occurrence in bitext is a boolean predicate that indicates whether a given pair of word tokens co occur in corresponding regions of the bitext space Co occurrence is a precondition for the possibility that two tokens might be mutual translations Models of co occurrence are the glue that binds methods for mapping bitext correspondence with methods for estimating translation models ...
متن کاملAutomatic Detection of Omissions in Translations
ADOMIT is an algorithm for Automatic Detection of OMIssions in Translations. The algorithm relies solely on geometric analysis of bitext maps and uses no linguistic information. This property allows it to deal equally well with omissions that do not correspond to linguistic units, such as might result from word-processing mishaps. ADOMIT has proven itself by discovering many errors in a hand-co...
متن کاملAn Automatic Filter for Non-Parallel Texts
Numerous cross-lingual applications, including state-of-the-art machine translation systems, require parallel texts aligned at the sentence level. However, collections of such texts are often polluted by pairs of texts that are comparable but not parallel. Bitext maps can help to discriminate between parallel and comparable texts. Bitext mapping algorithms use a larger set of document features ...
متن کاملParsing Word-Aligned Parallel Corpora in a Grammar Induction Context
We present an Earley-style dynamic programming algorithm for parsing sentence pairs from a parallel corpus simultaneously, building up two phrase structure trees and a correspondence mapping between the nodes. The intended use of the algorithm is in bootstrapping grammars for less studied languages by using implicit grammatical information in parallel corpora. Therefore, we presuppose a given (...
متن کامل